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(54) Continuous speech recognition 

(57) A method for use in recognizing speech in 
which signals are accepted corresponding to inter- 
spersed speech elements including text elements cor- 
responding to text to be recognized and command ele- 
ments to be executed. The elements are recognized. 
Modification procedures are executed in response to 



recognized predetermined ones of the command ele- 
ments. The modification procedures include refraining 
from training speech models when the modification pro- 
cedures do not correct a speech recognition error. In an- 
other aspect, the modification procedures include simul- 
taneously modifying previously recognized ones of the 
text elements. 
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Description 

This invention relates to continuous speech recognition. 

Many speech recognition systems, including DragonDictate™ Irom Dragon Systems™ ol West Newton. Massa- 

s chusetts store data representing a user's speech (i.e., speech frames) tor a short fist of words, e.g., 32. just spoken 
by the user If a user determines that a word was incorrectly recognized, the user calls up (by keystroke, mouse se- 
lection or utterance) a correction window on a display screen. The correction window displays the short list of words 
or a portion ol the short list of words, and the user selects the misrecognized word for correction. Selecting a word 
causes the speech recognition system to re-recognize the word by comparing the stored speech frames associated 

io with the word to a vocabulary of speech models. The comparison provides a choice list of words that may have been 
spoken by the user and the system displays the choice list for the user. The user then selects the correct word from 
the choice list or the user verbally spells the correct word in the correction window. In either case, the system replaces 
the incorrect word with the correct word and adapts (i.e., trains) the speech models representing the correct word usmg 
the associated speech frames. , 

15 For more information on training speech models, see United States Patent No. 5,027.406, entrtled Method for 

Interactive Speech Recognition and Training', and United States Patent Application Serial No. 08/382,7o2, entitled 
•Apparatuses and Methods for Training and Operating Speech Recognition Systems', which are incorporated by ref- 
erence For more information on choice lists and alphabetic pref ittering see United States Patent No. 4,763,803. entrtled 
•Speech Recognition Apparatus and Method", United States Patent No. 4,866.773, entitled "Interactive Speech Rec- 

so ognition Apparatus', and United States Patent No. 5,027,406, entitled 'Method for Interactive Speech Recogn.lion and 
Training", which are incorporated by reference. 

Aside from correcting speech recognition errors, users often change their mind regarding previously entered text 
and want to replace one or more previously entered words with different words. To do this editing, users frequently call 
up the correction window, select a previously entered word, and then type or speak a different word. The system 

25 replaces the previously entered word with the different word, and, because training is cont.nuous, the system also 
adapts the speech models associated with the different word with the speech frames from the original utterance. This 
"misadaptation" may degrade the integrity of the speech models for the different word and reduce speech recognition 

For example the user may have entered 'It was a rainy day* and may want the lext to read 'It was a cold day.' If 
30 the user calls up the correction window, selects the word "rainy" and types in or speaks the word "cold", the system 
replaces the word "rainy" with the word "cold" and misadapts the speech models for "cold" wrth the speech models for 

If the speech recognition system misrecognizes one or more word boundaries, then the user may need to correct 
two or more words For example, if the user says 'let's recognize speech' and the system recognizes "let's wreck a 

35 nice beach," then the user needs to change 'wreck a nice beach" to 'recognize speech.' The user may call up the 
correction window and change each word individually using the choice list for each word. For example, the user may 
call up the correction window and select "wreck" as the word to be changed and choose "recognize' from the choice 
list (il available) or enter (by keystroke or utterance: word or spelling) "recognize" into the correction window. The user 
may then select and reject (i.e., delete) 'a' and then "nice', and lastly the user may select "beach" and choose "speech 

40 from the choice list or enter "speech" into the correction window. 

Alternatively after the user has called up the correction window and chosen "recognize", some speech recognition 
systems permit the user to enter a space after "recognize" to indicate to the system that another word correction follows. 
The system re-recognizes the speech frames following the newly entered word "recognize" and provides a hypothesis 
(e g 'speech") and a corresponding choce list for the user. The user chooses either the hypothesis or a word from 

45 the choice list and may again follow that word with a space to cause the system to re-recognize a next word. 

Other speech recognition systems have large storage capabilities that store all speech frames assocated with 
user utterances and record all user utterances. The user may selecl a previously spoken word to have the system play 
back the user's original utterance. If the utterance does not match the recognized word (i.e., the system misrecogn.zed 
the word) then the user may call up a correction window and type or speak the correct word to have the system make 

so the correction and train the speech models for the corrected word. This may reduce speech model misadaptahon by 
requiring the user to determine whether the system actually misrecognized the word before speech models are trained^ 
In general in one aspect, the invention features a method for use in recognizing speech. Signals are accepted 
correspondino to interspersed speech elements including text elements corresponding to text to be recognized and 
command elements to be executed. The elements are recognized Modification procedures are executed in response 

ss to recognized predetermined ones of the command elements. The modification procedures include refraining from 
training speech models when the modification procedures do not correct a speech recognition error. 

In general, in another aspect, the modification procedures include simultaneously modifying previously recognized 
ones of the text elements. 
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Implementations of the invention may include one or more of the following features. Text element boundaries (e. 
g , misrecognized boundaries) of the previously recognized ones of the text elements may be modified. Executing the 
modification procedures may include detecting a speech recognition error, and training speech models in response to 
the detected speech recognition error. The detecting may include determining whether speech frames or speech models 
5 corresponding to a speech recognition modification match at least a portion of the speech frames or speech models 
corresponding to previous utterances. Matching speech frames or speech models may be selected The predetermined 
command elements may include a select command and an utterance representing a selected recognized text element 
to be corrected. The selected recognized text element may be matched against previously recognized text elements. 
Previously recognized text elements may be parsed and a tree structure may be built that represents the ordered 
10 relationship among the previously recognized text elements. The tree structure may reflect multiple occurrences of a 
given previously recognized one of the text elements. The utterance may represent a sequence of multiple selected 
recognized text elements. One of the recognized text elements may be modified based on correction information pro- 
vided by a user speaking substitute text. The correction information may include correction of boundaries between text 
elements. The method of claim 1 in which the modification procedures include modifying one or more of the most 
'5 recently recognized text elements. 

The predetermined command elements may include a command (e.g., *oops") indicating that a short term correc- 
tion is to be made. The modification procedures may include interaction with a user with respect to modifications to be 
made. The interaction may include a display window in which proposed modifications are indicated. The interaction 
may include a user uttering the spelling of a word to be corrected. The modification procedures may include building 

20 a tree structure grouping speech frames corresponding to possible text elements in branches of the tree. The most 
recently recognized text elements may be re-recognized using the speech frames of the tree structure. The tree may 
be used to determine, text element by text element, a match between a correction utterance and the originally recog- 
nized text elements. The modification procedures may include, after determining a match, re-recognizing subsequent 
speech frames of an original utterance. If no match is determined, the recognized correction utterance may be displayed 

2S to the user. The command may indicate that the user wishes to delete a recognized text element. The text element 
may be the most recently recognized text element. 

The predetermined command may be 'scratch that*. The command may be followed by a pause and the most 
recently recognized text element may then be deleted. The command may be followed by an utterance corresponding 
to a substitute text element and the substitute text element is then substituted for the most recently recognized text 

30 element. 

The advantages of the invention may include one or more of the following. Providing the user with a variety of 
editing/correcting techniques allows the user to choose how they will edit or correct previously entered text. The tech- 
nique chosen may depend upon the edit or correction to be made or the user may choose the technique with which 
they are most comfortable. The different techniques also allow users flexibility as to when changes or corrections are 
35 made. For example, the user may edit continuously while dictating text or the user may dictate an entire document 
before going back to make changes or corrections. Furthermore, the user's cognitive overhead for correcting and 
editing previously entered text is reduced. For instance, speech models may be trained only when the speech recog- 
nition system, not the user, determines that a word or series of words has been misrecognized. Similarly, in response 
to a user's correction, the system may automatically modify word boundaries to simultaneously change a first number 
40 of words into a second number of different words. 

Other advantages and features will become apparent from the following description and from the claims. 

Fig. t is a block diagram of a speech recognition system. 

Fig. 2 is a block diagram of speech recognition software and application software. 

Fig. 3 is a block diagram of speech recognition software and vocabularies stored in memory. 
45 Fig. 4 is computer screen display of word processing command words and sentences. 

Fig. 5 is a flow chart depicting a long term editing feature. 

Figs. 6 and 7 are block diagrams of long term editing feature tree structures. 

Figs. 8a-8f are computer screen displays depicting the long term editing feature. 

Fig. 9 is a flow chart depicting a short term error correction feature. 
50 Figs. T0a-10e are computer screen displays depicting a short term speech recognition error correction feature. 

Fig. 1 1 is a computer screen display of a correction window and a spelling window. 

Figs. 12 and 13 are block diagrams of short term error correction feature tree structures. 

Fig. 14 is a flow chart depicting a scratch that editing feature. 

Figs. 15a - 15d show user interface screens. 
55 The speech recognition system includes several correction/editing features. Using one correction feature termed 

'short term error correction/ the user calls up (by keystroke, mouse selection, or utterance, e.g., "oops") a correction 
window and enters (by keystroke or utterance) one or more previously spoken words to correct a recently misrecognized 
utterance. The system compares speech models (for typed words) or speech frames (for spoken words) associated 
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with the correction against the speech frames ot a predetermined number, e.g. . three, of the user's previous utterances 
J i^SS^Se. speech frames corresponding to a portion of one of the user's previous three utterances that 

S^r^toi The modification of the original utterance includes re-recognizing the speech frame .around 
Se SSrtion As a result, a user may simultaneously correct one word, a series of words, or an entire utterance. 
ZZl ^Z^Z^ag^ word boundaries. The speech frames from the original utterance are *so used to 
train (\ e adaDU the speech models for the correction. 

I tS Smpanson^es no, locate speech frames corresponding to a portion of one of the user's previous three 
utteloJs" a substantial match the user's correction, then .he system notifies the user 
be made For example, if the user erroneously enters one or more different words as a correction the comparison wrtl 
Sot kiate corSnding speech frames in one of the user's previous three utterances. This reduces the poss.b.hty 

^ 2£ «n 9 term editing,- allows the user to select and mod*y previous, enteredtex,. 
Atter^etct no , Shrough keystrokes or mouse selection or by speaking the words to be selected, the user modifies 
2 L.eSd Ky S or speaking replacement words. The user may simultaneously modify one word, a series 
oTworSofS 

tmVdtting to edit previously entered text or to correct speech recognition errors, the system ^^^'^ 
tra" the speech models tor the modifications which substantially prevents misadaptation ot speech models. The user 
mav however request that the system train the speech models for a modification. 

A cTection/editing feature, farmed 'scratch that and repeat", allows the user to quickly and eas.ly delete or de.ete 
and £ZS^«o* -en, utterance After speaking an utterance, it the user determines that the systerr id* 
not correct* recognize the previous utterance, the user setects (by keystroke mouse select™, on 
-scratch that-) a scratch command and repeats the utterance. The system replaces the words recogn zed Irom the 
oSalu^ 

utte ance. the user enters the scratch that command alone (e.g., followed by silence), and if the user wants to , edrt the 
words of the previous utterance, the user speaks "scratch that' followed by new text, in any case, the system , does no. 
Jain speech models in accordance with any replacemen. text which reduces the possibHrty of m.sadaptafon o, speech 

^Referring to Fig. 1. apical speech recognition system 10 includes a microphone 12 for convening a us^ 
into an analog data signal 14 and a sound card 16. The sound card includes a digrtal signal processor (DSP) 19 and 
anatog .o digital (/Sd) convener 17 fo, convening the anatog data signal into a digital data signal IE b> jm <ng 
me anaTog data signal at about 11 Khz to generate 220 digital samples during a 20 msec time period. Each 20 ms rime 
period S 

to gfneTaTe a group of parameters associated wrth the analog data signal during the 20 ms penod. Generally, the 
narameters represent the amplitude of the speech at each of a set of frequency bands. 

P The DsJ ^also mentors the volume of the" speech frames to detect user utterances. If the volume of three consec- 
utive speech frames within a window of five consecutive speech frames (i.e.. three of the last five speech frames 
exceeds Tpredetolined speech threshold, for example, 20 dB. then the DSP determines that the analog signal 

a digital data signal 23 to a central processing unit (CPU) 20. The DSP asserts an utterance signal (Utt 22 to notrfy 
the CPU each time a batch o. speech frames representing an utterance is sent via the digital data signal. 

Whe ar Jenupt handler 24 on the CPU receives assertions of Utt signal 22, the CPU's norma, sequence of 
execurion is interrupted. Interrupt signal 26 causes operating system software 26 to call astore routine 29. Store rout^e 
29 sto es he searing batch of speech frames into a buffer 30. When fourteen consecutive speech frames wrthin a 
window ot nineteen consecutive speech frames fall betow a predetermined silence threshold, e g., 6 dB, then the DSP 
Ttops sending speech frames to the CPU and asserts an End.Utt signal 21 . The End Utt causes me tfore 
routine to organize the batches of previously stored speech frames into a speech packet 39 corresponding to the user 

"^tolenup, signal 26 also causes the operating system software to call monitor software 32. Monitor software 32 
keeps * count I of the number of speech packets stored but not yet processed. An application 36Jor example a 
word processor, being executed by the CPU periodically checks for user input by examining the monrtor scares 
count If the count is zero, then there is no user input. If thecoun, is not zero, then the application calls speech recognizer 
software 38 and passes a pointer 37 to the address tocaton ot the speech packet in buffer 30 The speech recogntter 
Z Z caSed directly by the application or may be called on behaH of the appl^ation by a separate program^ uch as 
oTagonDictate™ from Dragon Systems™ of West Newton, Massachusetts, in response to the applications request 

see Ungates Patent No. 5.027.406, entitled 'Method for Interacrive Speech Recogn ton and Training' which ,s 
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incorporated by reference. 

Referring to Fig. 2. to determine what words have been spoken speech recognition software 38 causes the CPU 
to retrieve speech frames within speech packet 39 from buffer 30 and compare the speech frames (i.e., the user's 
speech) to speech models stored in one or more vocabularies 40. For a more detailed description of continuous speech 
recognition, see United States Patent No. 5.202,952, entitled "Large-Vocabulary Continuous Speech Prefiltering and 
Processing System", which is incorporated by reference. 

The recognition software uses common script language interpreter software to communicate with the application 
36 that called the recognition software. The common script language interpreter software enables the user to dictate 
directly to the application either by emulating the computer keyboard and converting the recognition results into appli- 
cation dependent keystrokes or by sending application dependent commands directly to the application using the 
system's application communication mechanism (e.g., Microsoft Windows™ uses Dynamic Data Exchange™). The 
desired applications include, for example, word processors 44 (e.g., Word Perfect™ or Microsoft Word™), spread- 
sheets 46 (e.g., Lotus 1-2-3™ or Excel™), and games 48 (e.g., Solitaire™). 

As an alternative to dictating directly to an application, the user dictates text to a speech recognizer window and 
is after dictating a document, the user transfers the document (manually or automatically) to the application. 

Referring to Fig. 3, when an application first calls the speech recognition software, it is loaded from remote storage 
(e.g., a disk drive) into the computer's local memory 42. One or more vocabularies, for example, common vocabulary 
48 and Microsoft Office™ vocabulary 50, are also loaded from remote storage into memory 42. The vocabularies 43, 
50, and 54 include all words 48b, 50b, and 54b (text and commands), and corresponding speech models 48a, 50a, 
20 and 54a, that a user may speak. 

Spreading the speech models and words across different vocabularies allows the speech models and words to be 
grouped into vendor (e.g. , Microsoft™ and Novell™) dependent vocabularies which are only loaded into memory when 
an application corresponding to a particular vendor is executed for the first time after power-up. For example, many of 
the speech models and words in the Novell PerfectOffice™ vocabulary 54 represent words only spoken when a user 
is executing a Novell PerfectOffice™ application, e.g., WordPerfect™. As a result, those speech models and words 
are only needed when the user executes a Novell™ application. To avoid wasting valuable memory space, the Novell 
PerfectOffice™ vocabulary 54 is only loaded into memory when needed (i.e., when the user executes a Novell™ 
application). 

Alternatively, the speech models and words are grouped into application dependent vocabularies. For example, 
separate vocabularies may exist for Microsoft Word™, Microsoft Excel™, and Novell WordPerfect™. Similarly, the 
speech models and words corresponding to commands may be grouped into one set of vocabularies while the speech 
models and words corresponding to text may grouped into another set of vocabularies. As another alternative, only a 
single vocabulary including ail words, and corresponding speech models, that a user may speak is loaded into local 
memory and used by the speech recognition software to recognize a user's speech. 

Referring to Fig. 4, once the vocabularies are loaded and an application calls the recognition software, the CPU 
compares speech frames representing the user's speech to speech models in the vocabularies to recognize (step 60) 
the user's speech. The CPU then determines (steps 62 and 64) whether the results represent a command or text. 
Commands include single words and phrases and sentences that are defined by templates (i.e., restriction rules). The 
templates define the words that may be said within command sentences and the order in which the words are spoken. 
40 The CPU compares (step 62} the recognition results to the possible command words and phrases and to command 
templates, and if the results match a command word or phrase or a command template (step 64), then the CPU sends 
(step 65a) the application that called the speech recognition software keystrokes or scripting language that cause the 
application to execute the command, and if the results do not match a command word or phrase or a command template, | 
the CPU sends (step 65b) the application keystrokes or scripting language that cause the application to type the resufts ! 
45 as text. i 
For more information on this and other methods of distinguishing between text and commands, see United States I 
Patent Application Serial No. 03/559,207, entitled "Continuous Speech Recognition of Text and Commands", filed the f 
same day and assigned to the same assignee as this application, which is incorporated by reference. I 

Referring back to Fig. 3, in addition to including words 51 (and phrases) and corresponding speech models 53, 
the vocabularies include application (e.g., Microsoft Word™ 100 and Microsoft Excel™ 102) dependent command 
sentences 48c, 50c, and 54c available to the user and application dependent groups 48d, 50d, and 54d which are 
pointed to by the sentences and which point to groups of variable words in the command templates. 
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Long Term Editing 



The long term editing feature provides the user with the flexibility to edit text that was just entered (correctly or 
incorrectly) into an open document or to open an old document and edit text entered at an earlier time. Referring to 
Fig. 5, the system first determines (step 1 30) whether the user has spoken, and if so, the system recognizes (step 1 32) 
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the user's speech The system then determines (step 134) whether the user said 'selectrK the user 

•«*».• is feKMd oy a pause K -»W is tollo.aa o, a paaaa. man ma «M a«.,a (slap 140) Ww«0 setta 

throuoh a standard edit control request to the operating system or through an apphca ,on program menace (API) 

consist^ tZ example 180,000, words. As an example, "hello there." is parsed into three words, hello there 
EE^ia^ 

execute these steps before the select command is issuea by the user (e.g. . v.he. . a „ocum~. h i. iin up . . 

She words on the d,splay screen change) or the system may execute these steps when the select command ,s 

-seiecr ^Ti e ^^ pilfspeech cognition results are shown) must match one or more wo^ 
tex (e Q i) Thus, the system compares (step 156) the words of the newly recogmzed text (e g test ) to 
text (e.g.. lesi J J* recognized text match at least a porton of 

If the user doe not agree with , the sy* ^ ^ ^ ^ ^ system 

SoTaref ep 5 

a. slve aMocations on the display screen, then the newly recognized speech matches mult.ple portions of 
-™ 

S M ! cTeni cuL poshion). ,f the user requests a re-compare, then the system selects the next closes, match 312 
U ttneX^ed text is no, displayed elsewhere on the display screen and.he user requests • 

terance e q -abort', step 164) and exit out of the long term edit.ng feature. 

aT™ Lmole if the displayed text is This is a test ol speech" and the user says "select test ( select a test or 
•sole* a^oH hen he sy£ determines that "test" ("a test' or "a test of) matches a portion of the tree structure 
tT^Fic 6 and selects (i e highlights) "test" ("a ,es," or "a test of) on the display screen If the user degrees w,th 
the se, ctf^^ 

oiruser mav exit out of the selection. I, the user agrees with the selection, then the system selects ( 66 the match ng 
^TSSSSS men the system determines .hat the user was dictating text and ^^«V£S 
lommandTd enters (step 1 59) "select" and the recognized text on the display screen. For example, .. the d.spbyed 
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text is "This is a test of speech" and the user says 'select this test", the system determines that the recognized text 
does not match the tree structure and types "select this test" on the display screen. 

Because the long term editing feature does not compare speech frames or models of a user's text selection to 
speech frames or models of the previously entered text, the system need not save speech frames for entire documents 
and the user has the flexibility to edit newly entered text in an already open document or to open an old document and 
edit text within that document. The system also does not adapt speech models for edited text when the long term 
editing feature is used because the user's edits may or may not correct speech recognition errors. This substantially 
prevents misadaptation. Furthermore, because the user can simultaneously replace multiple pre-existing words with 
multiple new words, the user may use the long term editing feature to change misrecognized word boundaries. 

Shon Term Speech Recognition Error Correction 



The short term error correction feature allows the user to correct speech recognition errors in a predetermined 
number (e.g., three) of the user's test utterances. The correction may simultaneously modify one or more words and 
correct misrecognized word boundaries as well as train the speech models for any misrecognized word or words. The 
system only modifies a previous utterance and trains speech models il the user's correction substantially matches 
speech trames corresponding to at least a portion of the previous utterance. This substantially prevents misadaptation 
of speech models by preventing the user from replacing previously entered text with new words using the short term 
error correction feature 

Referring to Figs 9 and ifoiOo when a user determines that a speech recognition error 320 has occurred within 
the last three uitwrtiico i-io u*ei may say "Oops" 322 (Fig. 10b) or type keystrokes or make a mouse selection of a 
correction wmctow icon When the system determines (step 178) that the user has issued the oops command, the 
system displays rstcp 1eO. « correction window 182 (Fig. 10c)on display screen 136 and displays (step 183) the last 
utterance 1S4 in a correction sue window 186. The system then determines (step 188) whether the user has input (by 
25 keystroke or uttcr;jncci corrected text (e.g., "This" 324, Fig. I0d). For example, if the user said This ability to talkfast" 
and the system rocogn/od "Disability to talk fast", the user may say "oops" and then repeat or type "This" (or 'This 
ability" or "This ability to talk" etc ) 

If the system determines (step 1 90) that the user spoke the corrected text, then the system recognizes (step 1 92) 
the user's speech Instead of providing words as corrected text, the user may enter (by keystroke, mouse selection, 
30 or utterance, e g , "speil thai". Fig 11 ) a spelling command followed by the letters of the words in the corrected text 
After determining that the user entered the spelling command, the system displays a spelling window 1 94. The system 
then recognizes the letters 196 spoken or typed by the user and provides a choice list 197 corresponding to the rec- 
ognized letters. For more information regarding the spelling command and speech recognition of tetters, see United 
States Patent Application Serial No. 08/521,543, entitled "Speech Recognition", filed August 30, 1995, and United 
35 States Patent Application Serial No. 03/559,190 entitled "Speech Recognition", filed the same day and assigned to 
the same assignee as this application. 

Referring also to Fig 12. whether the user types or speaks the corrected text, the system builds (step 198) a tree 
structure (e.g., 200) for each of the last three utterances using the speech frames corresponding to these utterances 
and the speech frames (if spoken) or speech models (if typed) corresponding to the corrected text. The system then 
re-recognizes (step 202) each of the last three utterances against the corresponding tree structure to determine (step 
204) if at least a portion of the speech frames in the corresponding utterance substantially match the speech frames 
or models corresponding to the corrected text. Each state 210-220 in the tree structure includes one or more speech 
frames corresponding to a previously recognized word in the utterance, the remaining speech frames in the utterance, 
and the speech frames or models corresponding to a first recognized word in the corrected text. 

For example, if the user says "Let's recognize speech" and the system recognizes "Let's wreck a nice beach", the 
user may say "oops" to call up the correction window and say "recognize" as the corrected text. State 21 0 includes all 
of the speech frames of the utterance and the speech frames corresponding to "recognize*, while state 216 includes 
only the speech frames corresponding to "nice", the remaining speech frames of the utterance (e.g., "beach"), and the 
speech frames corresponding to "recognize". State 220 includes only the speech frames corresponding to "recognize" 
to prevent the system from reaching final state 222 before at least a portion of the speech frames in the utterance are 
found to substantially match the speech frames corresponding to "recognize". 

If the system determines that the initial speech frames of the utterance best match the speech models in the system 
vocabulary for the word "let's", then the system determines whether the next speech frames best match "wreck" or 
"recognize", if the system determines that the speech frames best match "wreck", the system determines whether the 
next speech frames best match "a" or "recognize"'. The system makes this determination for each of the originally 
recognized words in the utterance. 

During re-recognition, the system determines which path (from state 210 to 222) has the highest speech recognition 
score, fnitially, the system is likely to reach state 220 after re-recognizing the original utterance as it originally did, i.e., 
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, • w. k. aw ,» a rhino slate 220 however the system cannot match any remaining speech Irames 
,efs wreck a n^ch ZZZor this path is very low and the system disregards this path 

SSSKSS leonng Path is ,efs recogn,ze speech" (as opposed to other possio.e 

""T.^SSJ EES JKSSSSr Sound, then the system transttions tc .final state 222 and^ 

thai a correcl match will be found. a/WWTiTft . « , he corrected text the user provides multiple words 

Fore X amp,e,nsteado = 

of speech frames that the system must store. 
35 Scratch That and Repeat 

so misadaptation of speech models. 

Other embodiments are within the scope of the following claims. ^*„t ft ««*o M eeh 
For exlmole instead* having a digital signal processor (DSP) process the samples corresponding to each speech 
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models are adapted. If the system determines that the user corrected a speech recognition error, then the system trains 
the speech models accordingly. 

Many optimizations to improve speech recognition performance are possible. For example, typed text cannot cause 
speech recognition errors, and, as a result, during short term error correction re-recognition (step 202, Fig. 10) when 
the system is re-recognizing the remaining speech frames against the system vocabulary (state 222, Fig. 12), the 
system may increase the speech recognition score for words matching text that the user entered through keystrokes. 

Pseudo-Code 

Following is pseudo-code derived from C Programming Language Code that describes the process for Long Term 
Editing and Short Term Speech Recognition Error Correction: 
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y.^ng TWrTti jEditinq 

start: 

wait for start of speech 
start recognition of speech 

if first word of the recognition is "select" 
build-the-select-grammar 

recognize the utterance against the select-grammar 
if the recognition matches the select-grammar 
search-for-the-indicated-words 

remember the utterance and recognition results as 
last-select-result 

goto start 
otherwise, 

interpret recognition as text 
type-text-on-the-screen 
delete the last-select-result 
goto start 
otherwise, 

if the recognition matches "try again" and there is a 
last-select-result 
search-for-the-indicated-words in the 

last-select-result 

if the words found by the search are not the exact 
same occurrences which were first selected by this 

transcription of the results 
goto start 
otherwise, 

change the last-select-result to the next best 
unused transcription of the utterance saved in 

last-select-result 
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if there are no more unused transcriptions in 
last-select-result 
goto start 
otherwise , 

search-f or-the-indicated-words in the next best 

transcription 
goto start 
otherwise, 

continue recognition 
type-text-on-the-screen 
delete the last-select-result 
goto start 

search-f or-the-indicated-words : 

set the current word to be the word on the screen just 
before the selection 

loop: 

if the text on the screen starting with the current 
word matches the indicated words 

set the selection to text on the screen just compared 

against 
return from subroutine 
otherwise, 

if the current word is the first word on the screen 
set the current word to be the last word on the 

screen 

otherwise, 

change the current word to be the word on the 
screen before the current word 

then, 

if the current word is the first word in the 
selection 

return from subroutine 
otherwise, 
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goto loop 

type-text-on-the-screen : 

if words are selected on the screen 

delete the words which are selected 

leave the insertion point at the point where words 
were deleted 

type the text at the current insertion point 

otherwise, 

type the text at the current insertion point 

build-the-select-grammar: 

create a state with the word "select- 

create a large state which will hold all the words 

add a transition from the word "select- to the large 

state 

set the last-small-state variable to null 

set the last-word-in-large-state variable to null 

read the screen into a buffer 

parse the buffer into a series of words 

for each word in the buffer 

look the word up in the dictionary to get a speech 

model 

if the word is not in the dictionary 
~ try to create a speech model for this word by 
generating a pronunciation using text to 
speech synthesis rules 
if no speech model can be created for this word 
skip this word 

set the last-small-state variable to null 
set the last-word-in-large-state variable to null 
continue with the next word in the buffer 
then, 

create a small state containing only this word 
if the last-small-state variable is not null 
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add a transition from the last-small-state to this 
new state 

set the last-small-state variable to be this newly 

created small state 
if the last-word-in-large-state variable is not null 
add a transition from the last-word-in-large-state 
to this new state 

if the word is not in the large state 
add the word to the large state 

set the last-word-in-large-state variable to this 
new word 

continue with the next word in the buffer 
otherwise, 

set the last-word-in-large-state variable to the 

existing occurrence of the word in the large 
buffer 

continue with the next word in the buffer 
if there are no more words in the buffer 
30 return from subroutine 

Short Term Speech Recognition Err or Correction 

35 

start : 

wait for speech 
recognize the speech 

40 

remember the utterance in a four element 

first-in-first-out (FIFO) queue 
if utterance is not "oops" 
45 perform the indicated command or type the recognized 

text 
goto to start 
so otherwise, 

concatenate the results from the last four utterances 
in the FIFO queue into a single long string 

display a correction dialog box with two fields, the 

55 
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first field should be blank and the second 
field should contain the concatenated results 

goto loop 
loop: 

wait for speech or another user action 
if more than 2 seconds have elapsed since the 
contents of the first field in the dialog have changed 
recompute-the-correction 

goto loop 
otherwise, 

if speech is detected and the speech recognized 
"press 

OK" or the user clicks the mouse on the OK 
button, or the user presses the enter key 
if the contents of the first field in the dialog 

have 

changed since the correction was last 

recomputed 
recorapute-the-correction 

then , 

if there is a corrected utterance 
update-the-original-document 

then, 

destroy the correction dialog 
goto start 
otherwise, 

if speech is detected and the speech recognized 
"press 

cancel" or the user clicks the mouse on the 
Cancel button, or the user presses the escape 
key 

destroy the correction dialog 
goto start 
otherwise, 
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if speech is detected 
recognize the speech 

enter the recognized text into the first field of 

the 

dialog 

record that the first field of the dialog has 
changed 

goto loop 
otherwise , 

if the user starts typing 

enter the typed keystrokes into the first field of 

the 

dialog 

record that the first field of the dialog has 
changed 

goto loop 
otherwise, 
goto loop 

updat e-the-or igina 1 -document : 

find the corrected utterance in the original document 
remove the original text of the corrected utterance 
replace the original text with the corrected text 
return from subroutine 

recompute-the-correction : 

read the contents of the first field of the dialog into 

a 

buf f er 

parse the buffer into a series of words 
for each word in the buffer 

look the word up in the dictionary to get a speech 
model 

if the word is not in the dictionary 

try to create a speech model for this word by 
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generating a pronunciation using text to 

speech synthesis rules 
* if no speech model can be created for this word 

display an "unknown word" error to the user 
return from subroutine 

10 otherwise, 

remember these words as the target words 

then, 

for each utterance in the FIFO queue 

compute-a-possible-correction for this utterance and 

the 

target words 

record the score of this possible correction and the 
correction itself 

then, 

compute the maximum score of all computed possible 

corrections 
if the maximum score is zero 

display "utterance can not be corrected" error to the 
user 

return from subroutine 
otherwise , 

remember the highest scoring computed possible 
correction as the corrected utterance 
concatenate the results from the last four utterances 
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the FIFO queue into a single long string 
replace the results for the corrected utterance with 

the 

computed possible correction 
replace the second field with the corrected 
concatenated 

string 

highlight the words in the corrected results which 
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correspond to the words in the first field of 
the dialog box 
return from subroutine 

compute-a-possible-correction : 

create-a-correction-grammar using the utterance and the 
target words 

recognize the utterance against the correction grammar 

look in the results for the target words 

if the target words do not appear in the results 

return 0 
otherwise, 

record the results of the recognition as a possible 

correction 
return the score from the recognition 

create-a -correction-grammar : 

set the last-target-word to NULL 
for every target word 

create a small state containing the next target word 
if the last-target -word is not NULL 

add a transition from the last-target-word to this 

new 

small state 

set the last- target-word equal to the current target 
word 

then, 

add a transition from the last-target-word to the state 

of 

all words in the vocabulary 
set the last-original-word to NULL 
for every word in the original recognition results 
create a small state containing the next word in the 

original results 
if the last-original-word is not NULL 
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10 



IS 



add a transition from the last-original-word to 

this 

new small state 

then, 

if the current word in the original recognition 
results 

is not the same as the first target word 
add the first target word to this state 
then, 

if there is only one target word 

add a transition from the first target word in this 
new small state to the state of all words in 
20 the vocabulary 

otherwise, 

add a transition from the first target word in this 
new small state to the small state created 
earlier which contains the second target word 

then, 

set the last-original-word equal to the current word 
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in 



the original results 

then, 

35 add a transition from the last-original-word to the 

small 

state created earlier which contains the 
40 first target word 

return from subroutine 



Claims 

1. A method lor use in recognizing speech comprising: 

accepting signals corresponding to interspersed speech elements including text elements corresponding to 
text to be recognized and command elements to be executed, 
recognizing the elements, and 

executing Location procedures in response to recognized predetermined ones of the command elements, 

'^'^refraining from training speech models when the modification procedures do not correct a speech rec- 
ognition error. 

2. A method for use in recognizing speech comprising: 
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accepting signals corresponding to interspersed speech elements including text elements corresponding to 
text to be recognized and command elements to be executed, 
recognizing the elements, and 

executing modification procedures in response to recognized predetermined ones of the command elements, 
inciuding: 

simultaneously modifying previously recognized ones of the text elements. 

3. The method of claim 2 in which simultaneously modifying previously recognized text elements includes simulta- 
neously modifying text element boundaries of the previously recognized ones of the text elements. 

4. The method of claim 3 in which the text element boundaries were misrecognized. 

5. The method of claim 1 in which executing the modification procedures includes: 

75 detecting a speech recognition error, and 

training speech models in response to the detected speech recognition error. 

6. The method ol claim 5 in which detecting further includes: 
determining whether speech frames or speech models corresponding to a speech recognition modification 

match ni lortsi d por Uon ol the speech frames or speech models corresponding to previous utterances. 
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7. The method ol cimm C further including: 

selecting match ng speecn frames or speech models 

2B 8. The method of cliiim i m which the predetermined ones of the command elements include a select command. 

9, The method of claim i m which the command elements include an utterance representing a selected recognized 
text element to be corrected 

30 1 o. The method ol claim S in which the modification procedures include matching the selected recognized text element 
against previously recognized text elements. 

11. The method of claim 6 in which the modification procedures include parsing previously recognized text elements 
and building a tree structure that represents the ordered relationship among the previously recognized text ele- 

35 ments. 

12. The method of claim 11 in which the tree structure. reflects multiple occurrences of a given previously recognized 
one of the text elements. 

*o 13. The method of claim 6 in which the utterance represents a sequence of multiple selected recognized text elements. 

14. The method of claim 1 in which the modification procedures include 
modifying one of the recognized text elements. 

45 15. The method of claim 14 in which the modifying is based on correction information provided by a user. 

16. The method of claim 15 in which the correction information is provided by the user speaking substitute text ele- 
ments 

50 17. The method of claim 16 in which the correction information includes correction of boundaries between text ele- 
ments. 

18. The method of claim 1 in which the modification procedures include modifying one or more of the most recently 
recognized text elements. 

55 

19. The method of claim 13 in which the predetermined ones of the command elements include a command indicating 
that a short term correction is to be made. 
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20. The method of claim 19 in which the command comprises ■oops". 

21 . The method of claim 18 in which the modification procedures include interaction with a user with respect to mod- 
ifications to be made. 

22. The method of claim 21 in which the interaction includes a display window in which proposed modifications are 
indicated. 

23. The method of claim 21 in which the interaction includes a user uttering the spelling of a word to be corrected. 

24. The method of claim 18 in which (he modification procedures include building a tree structure grouping speech 
frames corresponding to possible text elements in branches of the tree. 

25. The method of claim 24 in which the modification procedures include re-recognizing the most recently recognized 
is text elements using the speech frames of the tree structure. 

26. The method of claim 24 in which the tree is used to determine, text element by text element, a match between a 
correction utterance and the originally recognized text elements. 

bo 27. The method of claim 26 in which the modification procedures include, after determining a match, re-recognizing 
subsequent speech frames of an original utterance. 
28. The method of claim 26 in which, if no match is determined, the recognized correction utterance is delayed to 
the user. 

25 29. The method of claim 1 in which the command indicates that the user -wishes to delete a recognized text element. 

30. The method of claim 29 in which the text element is the most recently recognized text element. 
30 31 . The method of claim 29 in which the command comprises "scratch that' 

32. The method of claim 29 in which the command is followed by a pause and the most recently recognized text 
element is then deleted. 

as 33 The method of claim 29 in which the command is followed by an utterance corresponding to £ . substitute text 
element and the substitute text element is then substituted for the most recently recognized text element 
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